Correcting BLAST e-Values for Low-Complexity Segments

نویسندگان

  • Itai Sharon
  • Aaron Birkland
  • Kuan Chang
  • Ran El-Yaniv
  • Golan Yona
چکیده

The statistical estimates of BLAST and PSI-BLAST are of extreme importance to determine the biological relevance of sequence matches. While being very effective in evaluating most matches, these estimates usually overestimate the significance of matches in the presence of low complexity segments. In this paper, we present a model, based on divergence measures and statistics of the alignment structure, that corrects BLAST e-values for low complexity sequences without filtering or excluding them and generates scores that are more effective in distinguishing true similarities from chance similarities. We evaluate our method and compare it to other known methods using the Gene Ontology (GO) knowledge resource as a benchmark. Various performance measures, including ROC analysis, indicate that the new model improves upon the state of the art. The program is available at biozon.org/ftp/ and www.cs.technion.ac.il/ approximately itaish/lowcomp/.

منابع مشابه

Composition-modified matrices improve identification of homologs of saccharomyces cerevisiae low-complexity glycoproteins.

Yeast glycoproteins are representative of low-complexity sequences, those sequences rich in a few types of amino acids. Low-complexity protein sequences comprise more than 10% of the proteome but are poorly aligned by existing methods. Under default conditions, BLAST and FASTA use the scoring matrix BLOSUM62, which is optimized for sequences with diverse amino acid compositions. Because low-com...

متن کامل

Application of compression-based distance measures to protein sequence classification: a methodological study

MOTIVATION Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algor...

متن کامل

Sequence complexity of disordered protein.

Intrinsic disorder refers to segments or to whole proteins that fail to self-fold into fixed 3D structure, with such disorder sometimes existing in the native state. Here we report data on the relationships among intrinsic disorder, sequence complexity as measured by Shannon's entropy, and amino acid composition. Intrinsic disorder identified in protein crystal structures, and by nuclear magnet...

متن کامل

Lossless Data Compression with Error Correcting Codes

Existing zero-error variable-length data compression algorithms suffer from sensitivity to transmission errors and error propagation across packets when used in packet-based data transmission through noisy channels. We propose a new approach to lossless data compression based on error correcting codes and the block-sorting transform. It has long been recognized that linear source codes achieve ...

متن کامل

Low-complexity iterative QRD-M detection algorithm for V-BLAST systems

Introduction: In wireless communications, the Vertical Bell Labs Layered Space-Time (V-BLAST) system has been considered as an efficient architecture to achieve very high data rates in rich scattering environments [1]. The great potential of the V-BLAST system can be achieved with maximum likelihood detection (MLD) in the receiver. However, the complexity of MLD exponentially grows with increas...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 12 7  شماره 

صفحات  -

تاریخ انتشار 2005